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ABSTRACT 

A 30- item multiple-choice word analogy test and a 
corresponding 30-item picture analogy test (in which the pictures 
corresponded to the words in the word analogy test) were administered 
to 289 Civil Service employees. The equivalence of semantic (word) 
and figural (picture) test presentation of the same items was 
determined by comparing the responses of the same subject to the same 
item. Proportion of correspondent responses (both correct or both 
wrong) ranged from .69 to .91 with a median of .84. Correlation 
between scores on the two test forms was .86. Over 84% of the 
subjects gave correspondent responses with greater than chance 
frequency. Score distributions were practically identical. It was 
concluded that semantic and figural parallel test forms can be 
constructed. (Author) 
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The Equivalence of Semantic and Figural 
Test Presentation of the Same Items 

Howard E. A. Tinsley 
and 

Rene 1 V. Dawis 

In 1962 , several books were published which, along with a subsequent 
awakening to the realities of racism in America, were helpful in focusing 
public attention on the cultural fairness of psychological tests (Black, 

1962; Gross, 1962; Hoffman, 1962). A great deal of discussion followed 
regarding the feasibility and desirability of developing "culture- fa ir 11 
tests (Anastasi, 1967; Ash, 1967; Doppelt and Bennett, 1967; Krug, 1964; 
and Wesman, 1968). The research evidence bearing on this question is mixed. 

A number of studies on the prediction of educational achievement have in- 
dicated that validity coefficients for black and other culturally disadvan- 
taged students are as high or higher than validity coefficients for white 
or culturally advantaged students (Cleary, 1966; Hewer, 1965; Hills, 1964; 
Roberts, 1964; and Stanley and Porter, 1967); similar findings have been 
observed in at least one study of the prediction of vocational criteria 
(Gordon, 1953). On the other hand, a series of studies with employees of 
the Port of New York Authority (Lopez, 1966) revealed different relationships 
between predictors and job performance criteria for black and white toll 
collectors and maintenance men. Moreover, an extensive investigation by the 
Research Center for Industrial Behavior at New York University (Kirkpatrick, 
Ewen, Barrett, and Katzell, 1968) indicated that many tests performed equally 
well in different ethnic groups but that in some cases different tests worked 
best for different groups. Inclusion of an index of cultural disadvantage as 
a moderator variable improved test validity for some jobs. Included in this 
study were some 1200 persons r* white. black nnd-Tuerto Rican clerical workers, 



nursing students, and participants in job training programs for maintenance 
work and heavy equipment operation. 

Anastasi (1964) has pointed out that in designing il culturally fair” 
tests, it is important to distinguish between those cultural factors that 
affect both the test and criterion behavior, and those that influence only 
the test behavior. The former are necessary to insure the validity of the 
test. It is the ”tes t-specific cultural factors” which Culturally bias” a 
test. Research with the mentally retarded indicates that verbal ability may 
constitute one such biasing factor in tests designed to predict vocational 
criteria. A number of investigators have reported that the diagnosis of a 
person as mentally retarded on the basis of his performance on a verbal test 
is vocationally meaningless (Bobroff, 1956; Collman and Newlyn, 1957; Kauppi, 
I960; Kauppi and Weiss, 1967; Muench, 1944; and Seashore, Wesman and Doppelt, 
1950). In general, the evidence indicated that the vocational adjustments of 
the mental retards was far too heterogeneous and showed far too much overlap 
with that of non-retarded workers for the diagnosis of a person as mentally 
retarded to have valid implications for vocational success. Kauppi and Weiss 
(1967, p. 340) concluded that ,f Knowing that a client is mentally retarded 
tells the counselor only that he is probably below average on verbal tasks. 
The label says little about other abilities, interests, needs or potential.” 

Other researchers have also indicated the desirability of eliminating 
the verbal ability ,f bias” found in tests. The United States Employment 
Service (Jurgensen, 1966) has experimented with non-reading forms of several 
GATB tests. Rimland (1967) has suggested the use of the Porteus Maze test, 
a non-verbal test of general mental ability. Freeberg (1970) has exper- 
imented with verbal tests and with tests in which pictorial information is 
accompanied by verbal information, with primary emphasis given to making 
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the tests consist of more culturally relevant stimulus materials. And Krug 
(1964) has suggested the use of biographical information and situation tests. 

Guilford's (1956, 1959) Structure of Intellect provides the best 
theoretical framework within which to pursue this discussion. His model 
represents the human intellect as a cube having "content," "process," 
and "product" dimensions. Each cell in the cube represents a unique 
factor (or set of factors) of intellectual ability. Thus, a semantic test 
(one which uses words to ask questions, thereby requiring verbal ability) 
supposedly measures a factorially different ability than a figural test 
(one which uses pictures to ask questions). Because they measure different 
factors , the two types of tests may be differentially related to many 
criteria. Guilford (1959) has suggested that the abilities involved in 
using figural (picture) information are most closely related to success as 
a mechanic, machine operator, arti3t, or musician, and are related to 
success in certain aspects of engineering, while the abilities involved in 
using semantic (verbal) information are most closely related to success in 
educational settings where the learning of verbally presented facts and 
ideas is essential. 

There is a great deal of evidence, then, to suggest that a test is 
culturally "biased" only because the test measures some culturally related 
factor which does not influence the criterion behavior, e.g., a verbal 
ability component may "bias" tests used to predict criteria not influenced 
by verbal ability. The work of Guilford indicates that semantic (verbal) 
and figural (picture) tests measure different factors. It is possible, 
therefore, that figural tests may operate as unbiased predictors in those 
instances where semantic tests are culturally biased. The choice between 
semantic and figural tests, however, represents a kind of "all-or-none" 
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choice. Because the semantic and figural tests used in past research 
have often been developed independently without any attempt to make them 
equivalent, they may differ in a number of respects, only some of which 
are related to verbal ability. It is important to identify those factors 
which contribute to differences in test scores on the two types of tests. 
This will allow the elimination from a test of those factors which have 
a biasing effect and the retention of those factors which contribute to the 
predictive validity of the test. 

Helm (1954) has suggested two variables related to the structure of 
test items which he believes have an effect on the difficulty of the item. 
First, Heim has presented evidence that the type of question format 
(multiple choice or inventive answer) has a bearing on the difficulty of 
the item; Guilford (1959) has demonstrated that such questions measure 
factorislly different abilities. Again, Heim (1954) has suggested that 
differences in the internal structure of an item might influence item 
difficulty. Ace and Dawis (in press) have demonstrated this to be true 
under certain conditions. 

Of more interest to the present authors are three item characteristics 
suggested by Spearman (1927), who observed that the complexity, abstract- 
ness, and novelty of test items seemed to be the factors important in 
determining their difficulty. The present authors believe that if such 
variables as test instructions, time limits, administrator comments, item 
format, and the internal structure of the item are held constant, four 
factors may still operate to produce differences in the scores obtained 
on semantic (verbal) and figural (picture} tests. First, such tests may 
differ in the level of abstraction they reouire. Many concepts are easy 
to express verbally but are extremely difficult to express in a picture. 



7 



Examples include emotions (love, hate, affection), time other than present 
(past, future) and degree (better, best). Unless semantic and figural 
tests are equated for the level of abstraction, it is likely that the 
semantic test will require a higher level of abstract thinking. This may 
be detrimental if the criterion behavior is not related to ability in 
abstract thinking. Secondly, semantic and figural tests may differ in 
their "novelty". It seems likely that many respondents will find one type 
of test stimulus more familiar than the other. To do well on a semantic 
test requires a familiarity with the words used (a good vocabulary) while 
achievement on a figural test requires familiarity with the appearance of 
objects. Few people, for example, would recognize the word "ibex," yet 
most would recognize a picture of a wild mountain goat. Conversely, few 
people could identify a clutch or brake drum from a picture but many have 
those words in their vocabulary. A third way in which semantic and figural 
tests may differ is in their complexity. Campbell (1961) has reported that 
the effects of complexity (defined as the number of item properties to be 
taken into consideration in arriving at the correct answer) on the difficulty 
of symbol classification is due primarily to the nature of the classifying 
property. Classification by shape led to the least item difficulty; 
classification by size led to the most difficulty. Finally, semantic and 
figural tests usually represent different samples of test behavior. (This 
item characteristic is referred to hereafter as the item content, but should 
not be confused with Guilford's notion of "content" which refers to the type 
of stimulus material— words , pictures, numbers, or symbols— used to present 
the item.) Even when two semantic tests have been designed to be parallel 
measures of the same ability, they often do not yield identical ability 
estimates. Most semantic and figural tests have not been designed to be 
parallel, so differences in ability estimates are to be expected. 
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The present authors hypothesize, then, that the differences observed 
by Guilford (1959) in semantic and figural tests are due to differences in 
the abstractness, novelty, complexity, and content of the test items. Two 
tests which have been equated for these factors should yield roughly 
equivalent scores even though one uses semantic items while the other uses 
figural items. The remainder of this paper is concerned with an investigation 
of this question. 

METHOD 

Ins trumentation ; The analogy question format was selected for this research 
because of its wide use in tests and because analogy tests seem to represent 
Spearman's "g" more closely than other tests (Helmstadter , 1964, p. 99). A 
list of relationships which could be expressed in analogy format was compiled 
and used as a guide in constructing a pool of 100 picture analogies. A set 
of 30 picture analogies which included most of the pictures used in the 100 
picture analogies was administered to a group of 46 college students. In 
addition to completing the analogy, the subjects were asked to identify the 
object in each picture. Most pictures were identified by greater than 90% 
of the subjects. Those pictures which were correctly identified by fewer 
than 80% of the students were discarded and new pictures were taken to rep- 
resent the concept.^ The total item pool of 100 analogies was then admin- 
istered to 301 college students and the 30 picture analogies having the 
highest point-biserial correlation with total score were selected for further 
study. Next, thirty word analogies were constructed by expressing each 
picture analogy in word form, thus pairing every picture analogy with a word 
analogy of identical content. Because the items were so exactly paired in 

1 The authors wish to express their gratitude to Mr. Merle Ace, University 
of British Columbia, who supervised the construction of the 100 picture 
analogies and the analysis of the recognizabil? ty of the objects in the 
pictures . 
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terms of content, it is assumed that both items in each of the 30 item pairs 
were also equivalent in abstractness and complexity. The 60 analogy items 
were then combined in an instrument with the 30 word analogies first. For 
each type of analogy, the order of presentation was randomized. 

Both tests were designed to minimize the novelty of the items. The 
pictures in the picture analogies were of commonplace items although the 
relationship expressed by the analogy was often complex. The object in each 
picture was correctly identified by 30% of college students. All but 16 of 
the words in the word analogies appear on the Dale and Chall (1948) list of 
3000 words familiar to 00% of fourth graders. Because the novelty of the 
picture items was judged from the responses of college students while the 
novelty of the word analogies was judged from the responses of fourth grade 
students, the items may be imperfectly equated for their novelty with the 
picture analogies containing the more novel stimuli. 

Sub 1ect3 : The tests were administered to 209 Minneapolis Civil Service 

employees as part of a battery of tests. Twenty subjects were dropped for 
failure to respond to all of the items. The remaining 269 subjects were 
predominately white (96%) females (97%) who ranged in age from 18 to 64. 

The median age was 33; the modal ages were 20 and 21; 42% of the sample was 
26 years of age or younger. All but 4 subjects had a high school education, 
17% had some college, and 3.4% had a college degree. The median family 
income was $9000 per year; the modal income was $10,000 per year. 

Analysis: This research was concerned with the question of whether semantic 
and figural analogy items of equivalent abstractness, novelty, complexity, 
and content would yield equivalent results. Analyses were performed at the 
item and the test level. Because each question had five alternatives, the 
expected chance probability of a subject's correctly answering both items 
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in the pair was .04 (.2 x .2), the expected chance probability of his incor- 
rectly answering both items in the pair was .64 (.8 x .8), and the expected 

chance probability of correspondent responses (both responses correct or both 
responses incorrect) was .68. Accordingly, a 1-tailed z test was performed 
for each of the 30 pairs of items to determine whether the proportion of 
correspondent responses was significantly greater than .68. This represented 
an extremely stringent test of the hypothesis, however, as measurement at 
the single item level is seldom precise. At the test level, the data were 
analyzed as two 30-item analogy tests. A 2-tailed t-test was performed to 
determine whether the mean total scores on the semantic and figural forms 
were equivalent, an F test was performed to determine whether the variances 

the total scores on the two forms were equivalent, and the product-moment 

correlation was computed between the total scores on the two forms . 

The above analyses indicated the extent to which the semantic and figural 
items yielded statistically equivalent or correspondent results for the total 
sample. Also of interest was the degree to which the two types of items 
yielded equivalent measurement for each individual. A 2-tailed z test was 
performed for each of the 269 subjects to determine whether the proportion 
of correspondent responses to the item pairs departed significantly from the 
expected chance rate of .63. This, again, is a somewhat stringent test. 

Even the most rigorously developed of parallel forms will not yield identical 
scores for all subjects. It is justifiable, therefore, to ask whether the 
observed differences in scores can be explained in terms of the error of 
measurement. To answer this question, standardized difference scores were 

I 

computed for each subject* First, the standard error of measurement of the 
picture form was computed from the item analysis data* Then the difference 
between the total scores for each person on the word and picture analogies 
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was expressed as a proportion of this standard error of measurement. Wright 
(1967), in commenting on this procedure, points out that if the variation in 
scores is of the same magnitude as that expected from the error of measure- 
ment of the test, then the distribution of standardized difference scores 
should have a mean of zero and a standard deviation of 1.0. 



RESULTS 

For each of the 30 pairs of analogy items, the 1-tailed z test was 
employed to determine whether the proportion of subjects making correspondent 
responses significantly exceeded the proportion expected by chance. The 
proportion of correspondent responses was significantly greater than chance 
at the .005 level of confidence for 27 items; the 3 remaining items failed 
to achieve significance at the .05 level of confidence (see Table 1). 

Insert Table 1 about here 
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The data were also analyzed to determine whether the 30 semantic and 
30 figural analogies could be regarded as parallel forms having equal means 
and equal variances. The mean total score was 13*0 for the semantic 
analogies and 13.2 for the figural analogies; the variances were 24.4 and 
28.3 respectively. Neither the two sample t-test for the difference between 
means (t = .45, df « 263) nor the F-test for homogeneity of variance (F = 1.16, 
^ 3 268, 268) was significant at the *05 level of confidence. The correlation 
between scores on the semantic and figural forms was .86. 

For each subject, a 2-tailed z test was performed to determine whether 
the proportion of correspondent responses made by that person was signif- 
icantly different from the proportion that could be expected by chance. In 
order for the proportion of correspondent responses to exceed significantly 
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Table 1 

Number and Proportion of Correspondent Responses for Thirty 
Semantic Analogy-Figural Analogy Pairs 
(N=269) 



Item 


N 


Proportion of 






correspondent 






responses 


1 


217 


.80 


2 


186 


.69* 


3 


206 


.76 


4 


218 


.81 


5 


205 


.76 


6 


230 


.85 


7 


205 


.76 


8 


226 


.84 


9 


214 


.79 


10 


218 


.81 


11 


230 


.85 


12 


228 


.84 


13 


208 


.77 


14 


220 


.82 


15 


195 


.72* 


16 


240 


.90 


17 


246 


.91 


18 


209 


.77 


19 


236 


.87 


20 


246 


.91 


21 


204 


.76 


22 


230 


.85 


23 


228 


.84 


24 


227 


.34 


25 


229 


.85 


26 


228 


.84 


27 


189 


.70* 


28 


228 


.84 


29 


235 


.87 


30 


226 


.84 



* Not significant at the .05 level. All other pairs are 
significantly correspondent at the .005 level. 
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the proportion expected by chance (.63) , the subject needed to make 26 
(86.7%) correspondent responses; 125 (46.5%) of the 269 subjects fell in 
this category. Only 6 (2.2%) of the examinees made significantly fewer 
than chance (15 or less) correspondent responses; the remaining 138 subjects 
^ e ll 1° the chance range. In this latter group , 102 (73.9%) made correspon- 
dent responses to more than 63% of the item pairs. In all, then, 227 
(04.4%) subjects made correspondent responses with greater than chance 
frequency while 49 (15.6%) made correspondent responses with less than 
chance frequency. 

The standard error of measurement for the figural analogy test (as 
computed from the item analysis data) was 2.32. The mean and standard 
deviation of the distribution of standardized difference scores were -.08 
and 1.18 respectively. 

DISCUSSION 

The analyses at both the item and the test level support the conclusion 
that the semantic and figural analogies used in this study were measuring 
the same trait. At the item level, correspondent responses occurred at a 
significantly greater than chance frequency for 27 of the 30 analogies. 

An analysis of the three discrepant item pairs suggests that the quality 
of the pictures may account for their failure to support this conclusion. 

The analogies in question read: 

Peanut: :: Lettuce : Cabbage 

1. Plowed field 2. Butter 3. Potato 4. Raddish 5. Beans 

Carrot: ::; Orange : Innertube 

1. Block 2. Alligator 3. Canoe 4. Fire 5. Telephone 

: Hinge : : Arm : Elbow 

1. Handle 2. Door knob 3. Door frame 4. Desk leg 5. Door 
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In the picture form of the first analogy, the peanut, the raddish, and 

; 

the beans are particularly difficulty to identify. In the picture form of 

I; 

the second analogy, the innertube looks like a ring bologna or a sausage 
and the block looks like a bar of soap. Incidentally, these five pictures 
did not appear in the 30 picture analogies employed for the evaluation of 
picture -clarity. In the third analogy, the correct answer is door. It 

; seems likely that the distinction between door and door frame is not as 

< 

r 

clear in the picture form as it is in the word form of the analogy. It 

i was concluded, therefore, that the proportion of correspondent responses 

)• 

failed significantly to exceed the proportion expected by chance because 
i °f the clarity of some of the pictures used in these analogies. 

! At .the test level, the distributions of scores on the semantic and 

I- figural. tests. were practically the same. The figural test scores were 

j slightly more variable than the semantic test scores but the difference was 

i 

j -not -significant. The product-moment correlation between scores on the two 

forms' («-G6) was high considering the experimental nature of the two forms. 
AH of the. evidence indicates that the two tests are measuring the same 
trait. 

The..distribution of standardized difference scores also supports this 
conclusion. The data indicate that most of the differences observed in 
'ncores on~the two forms can be attributed to errors of measurement. The 
- small -amount of difference score variance remaining- after the variance 
--attcibutab-le to errors of measurement has been removed may well be due to 
-differences in the novelty of the stimuli or to the use of uninterpretable 
’-•pictures-. - 

The* above conclusions are based upon data from the entire sample. An 
-analysis- of the data for an individual at a time leads to essentially the 



I 

O 




15 



same conclusion. Over 04% of the subjects gave more correspondent responses 
than would have been predicted on the basis of chance responding. Only 
2.2% gave significantly fewer correspondent responses than expected on the 
basis of chance. 

These results, then, indicate that the distinction between semantic 
and figural tests needs to be examined more closely. Semantic and figural 
“parallel forms 11 can be constructed. This implies that the differences 
which have been observed in performance on such tests are not necessarily 
the result of differences in the stimulus material (pictures and words), 
but can be the result of other characteristics which usually covary with 
stimulus differences. The present authors suggest that the abstractness, 
novelty, complexity, and content of the items may be the most meaningful 
dimensions on which these items vary. Research on "culture-fair" tests may 
be more profitably spent in investigating these dimensions rather thau in 
comparing semantic (word) and figural (picture) tests. 
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